Counter-fitting Word Vectors to Linguistic Constraints

名词解释

antonymy:反义关系
synonymy:同义关系
DST:dialogue state tracking task

aim

improve the vectors’ capability for judging semantic similarity

result

  • lead to a new state of the art performance on the SimLex-999 dataset
  • result in robust improvements across different dialogue domains

Introduction

这是一篇出自剑桥大学和苹果公司的论文。

传统的词向量,例如Glove会有两个缺点,需要注入一些额外的知识特征来解决。

Drawbacks of learning word embeddings from co-occurrence information in corpora:

  • coalesce the notions of semantic similarity and conceptual association
  • similarity and antonymy can be application- or domain-specific

A method that addresse these two drawbacks by using synonymy and antonymy relations drawn from either a general lexical resource or an application-specific ontology to fine-tune distributional word vectors.

It’s a lightweight post-processing procedure in the spirit of retrofitting.

Most work on improving word vector representation using lexical resources has focused on bringing words which are known to be semantically related closer together in the vector space.

Some methods modify the prior or the regularization of the original training procedure.

Those word vectors which achieve the current state-of-the-art performance on the SimLex-999 are used as input for counter-fitting in the experiment of this paper.

The modelling work closest to this one is Qun Liu’sLearning semantic word embeddings based on ordinal knowledge constraints。He use antonymy and WordNet hierarchy information to modify the heavyweight Word2Vec training objective

Counter-fitting Word Vectors to Linguistic Constraints

the original word vectors: $V=(\mathbf{v}_1,\mathbf{v}_2,…,\mathbf{v}_N)$

new word vectors: $V\prime=(\mathbf{v}^\prime_1,\mathbf{v}^\prime_2,…,\mathbf{v}^\prime_N)$

AS是两个带有限制的集合,有多个对组成(i,j)。然后可以有三个约束方程。前两个是让同义词更近,让反义词更远。
第三个则是让原有空间的分布式信息尽可能地保留下来。

Antonym Repel (AR)


$$AR(V^\prime) = \sum_{(u,w) \in A} \tau(\delta-d(\mathbf(v^{\prime}_{\mu},v^{\prime}_{w})))$$

$d(v_i,v_j)=1-cos(v_i,v_j)$ is a distance derived from cosine similarity

$\tau(x) \triangleq max(0,x)$

The $\delta$ is the “ideal” minimum distance between antonymous words. In this paper $\delta=1$. The $d(v_i,v_j) \in [0,2]$. So when $d(v_i,v_j) \in [0,1]$ there is a cost. When $d(v_i,v_j) \in [1,2]$ the cost is zero because they are far enough.

Synonym Attract (SA)


$$SA(V^\prime) = \sum_{(u,w) \in A} \tau(d(\mathbf(v^{\prime}_{\mu},v^{\prime}_{w}))-\gamma)$$

This is similar to the AR. Just set the ideal distance $\gamma=0$. But I think it’s too strange. Because $d(\mathbf{v}_u^\prime,\mathbf{v}_w^\prime)-\gamma)\geq0$. Maybe $\gamma=1$ is reasonable.

Vector Space Preservation (VSP)


$$VSP(V,V^\prime) = \sum_{i=1}^{N} \sum_{j in N(i)} \tau (d(\mathbf{v}_i^\prime, \mathbf{v}_j^\prime) - d(\mathbf{v}_i, \mathbf{v}_j))$$

This formula is also strange. This means if we pull the two word closer, the cost will not increase.

然后讲三个优化目标进行线性组合


$$C(V,V^\prime) = k_1 AR(V^\prime) + k_2 SA(V^\prime) + k_3 VSP(V,V^\prime)$$

Injecting Dialogue Domain Ontologies into Vector Space Representations

They use RNN framework directly on the n-gram features extracted from the automated speech recognition hypotheses. A

Experiments

Word Vectors and Semantic Lexicons

Glove and Paragram-SL999 are publicly availiable.

用了两个词库的约束

  • PPDB 2.0 只使用其中的Equivalence关系和Exclusion关系,并且只用了single-token的项
  • WordNet没有使用其中的同义词

most frequent words: frequent word list

Improving Lexical Similarity Predictions

Use Spearman's rank correlation coefficient with the SimLex-999 dataset, which contains word pairs ranked by a large number of annotators instructed to consider only semantic similarity.

Retrofitting pre-trained word vectors improves GloVe vectors, but not the already semantically specialised Paragram-SL999. Counter-fitting substantially improves both sets of vectors, showing that injecting antonymy relations goes a long way towards improving word vectors for the purpose of making semantic similarity judegements.

Table 3 shows the effect of injecting different categories of linguistic constraints. There are three kinds of linguistics:

  1. PPDB- (PPDB antonyms)
  2. PPDB+ (PPDB synonyms)
  3. WordNet- (WordNet antonyms)

GloVe vectors benefit from all three sets of constraints. Paragram vectors which already exposed to PPDB, only improves with the injection of WordNet antonyms.

Table 4 shows counter-fitting correct 8 pairs in SimLex-999. Five of the eight pairs do not appear in the sets of linguistics constraints. This shows that secondary (i.e. indirect) interactions through the three terms of the cost function do contribute to the semantic context of the transformed vector space.

Improving Dialogue State Tracking

In this section, starting from Paragram vectors did not lead to superior performance, which shows that injecting the application-specific ontology is at least as important as the quality of the initial word vectors.

分享到